ing the Needleman Wunsch algorithm and the Smith Waterman
m. Two groups of scores show a significant difference. Therefore,
nment-based sequence comparison approach does have the
ation power to separate species based on the sequence homology
t scores. This also demonstrates that the sequence structure
ation determines the species differentiation.
The alignment results of aligning MT042778 (SARS-CoV-2) with AB889999
V) and of aligning MT042778 with QANK0100268 (Yersinia pastis) using the
based sequence comparison approach, i.e., the Needleman-Wunsch algorithm
ith-Waterman algorithm.
Align with
AB889999
Align with
QANK01002681
Needleman
Smith
Needleman
Smith
nment length
143
113
327
259
ical pairs
109
109
167
129
ity percentage (%)
76.2
96.5
51.1
49.8
percentage (%)
21.7
0.9
42.8
20.5
e k-mers approach
ng sequences using the homology alignment approaches, either
r local, is very costly as aforementioned. The alignment-free
comparison approach is therefore of great interest. Most
t-free sequence comparison approaches explore the pattern of the
statistics for sequence comparison and are still widely used in
plications [Lichtblau, 2019, Randhawa, et al., 2020; Rohling,
20].
f the basic principles of the alignment-free sequence comparison
is to use the pattern of the sequence statistics to represent
s [Le and Huynh, 2019; Nguyen, et al., 2019; Guo, et al., 2020].
cess is called a feature extraction process. Suppose the nth
is denoted by ܛ. The mapping between ܛ and a feature vector
mulated by the following equation, where ࣝ is a set of the nucleic
a set of amino acids, ℓ stands for the length of the nth sequence
nds for the feature space dimension,